Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 25, 2019

[sig-storage] Volume limits should verify that all nodes have volume limits [Suite:openshift/conformance/parallel] [Suite:k8s]

Introduced in rebase, https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/21860/pull-ci-openshift-origin-master-e2e-aws/2856/

@wongma7 @gnufied fyi

`[sig-storage] Volume limits should verify that all nodes have volume limits [Suite:openshift/conformance/parallel] [Suite:k8s]`
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 25, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2019
@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label Jan 25, 2019
@wking
Copy link
Member

wking commented Jan 26, 2019

Is this flaky, or is it just dead? I may have hit this on every run since the rebase ;).

@wking
Copy link
Member

wking commented Jan 26, 2019

unit:

FAIL: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime TestCreatePodSandbox_RuntimeClass/missing_RuntimeClass 0s
FAIL: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet/kuberuntime TestCreatePodSandbox_RuntimeClass 100ms

although I don't see how that would be due to f56f493, so

/retest

@wking
Copy link
Member

wking commented Jan 26, 2019

/lgtm

for good measure ;).

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

It succeeds extremely rarely. It just so happened that was the two prior runs to the rebase, so we thought it was gravy.

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@wking
Copy link
Member

wking commented Jan 26, 2019

Is it worth kicking until the batch job resolves?

@smarterclayton
Copy link
Contributor Author

I'm waiting for a green e2e-aws and then I'm going to force merge

@smarterclayton
Copy link
Contributor Author

/retest

@gnufied
Copy link
Member

gnufied commented Jan 26, 2019

This requires kublet-1.12 because it is a new feature that depends on volume plugin being registered on the node. So far flakes I have investigated related to this are still using 1.11 kubelet. But I am still looking.

@wking
Copy link
Member

wking commented Jan 26, 2019

e2e-aws-builds:

Failing tests:

[Feature:Builds][Slow] openshift pipeline build jenkins-client-plugin tests using the ephemeral template [Suite:openshift]

/test e2e-aws-builds

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@wking
Copy link
Member

wking commented Jan 26, 2019

e2e-aws:

2019/01/26 03:40:18 Ran for 1m15s
error: could not run steps: could not wait for template instance to be ready: could not determine if template instance was ready: failed to create objects: object is being deleted: pods "e2e-aws" already exists

/test e2e-aws

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@smarterclayton
Copy link
Contributor Author

/retest

@wking
Copy link
Member

wking commented Jan 26, 2019

I'm waiting for a green e2e-aws and then I'm going to force merge

Looking at origin's e2e-aws history, we haven't had anything pass in 11+ hours. So if you don't mind waiting it out while whatever the current batch job is fails, maybe this will just get merged without having to force it. Or maybe Tide will decide it needs to retest it in zounds of possible batch combinations, I dunno ;).

@wking
Copy link
Member

wking commented Jan 26, 2019

It's possible this run is dying with:

level=error msg="\t* aws_iam_role.bootstrap: Error creating IAM Role ci-op-2gqyb832-55c01-bootstrap-role: EntityAlreadyExists: Role with name ci-op-2gqyb832-55c01-bootstrap-role already 

although I don't know why that would still be running after an hour. You may want to bump your commit timestamp or something to get a fresh, new namespace.

@smarterclayton
Copy link
Contributor Author

I’m more concerned that teardown isn’t running. Is this the installer backgrounding?

@smarterclayton
Copy link
Contributor Author

/test e2e-aws

@wking
Copy link
Member

wking commented Jan 26, 2019

No teardown logs from this last run, but here's a normal teardown from a recent installer-PR failure:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_in
staller/1129/pull-ci-openshift-installer-master-e2e-aws/3174/artifacts/e2e-aws/container-logs/tear
down.log.gz | gunzip | tail -n 3
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:security-group/sg-0b6ccc03f79dbbb5f
" id=sg-0b6ccc03f79dbbb5f
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:vpc/vpc-03b6fdd1c55f21f19" id=vpc-0
3b6fdd1c55f21f19
level=info msg=Deleted arn="arn:aws:ec2:us-east-1:460538899914:dhcp-options/dopt-0616c23780c3426ec
" id=dopt-0616c23780c3426ec

@wking
Copy link
Member

wking commented Jan 26, 2019

My guess is the issue was job 2885 or a similar run. setup failed, but then test was killed:

level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
2019/01/26 04:08:59 Container setup in pod e2e-aws failed, exit code 1, reason Error
Another process exited
2019/01/26 04:09:14 Container test in pod e2e-aws failed, exit code 1, reason Error
{"component":"entrypoint","level":"error","msg":"Entrypoint received interrupt: terminated","time":"2019-01-26T05:21:56Z"}
2019/01/26 05:21:56 error: Process interrupted with signal interrupt, exiting in 2s ...
2019/01/26 05:21:56 cleanup: Deleting template e2e-aws

My guess is artifact gathering was slow "no cluster", and we reaped teardown before it completed. But I don't seee any teardown logs, so hard to know.

@smarterclayton
Copy link
Contributor Author

I’ve noticed some of that in jobs today.

@smarterclayton
Copy link
Contributor Author

/retest

@wking
Copy link
Member

wking commented Jan 26, 2019

We should short-circuit artifact gathering when Terraform fails. Have the installer exit 2? Grep the logs?

@wking
Copy link
Member

wking commented Jan 26, 2019

Ah, or gate on some really-basic API call succeeding. I can work that up tomorrow if you don't beat me to it ;).

@smarterclayton
Copy link
Contributor Author

I’ll probably be looking at other things so don’t worry about me also looking at it :)

@smarterclayton
Copy link
Contributor Author

What the fork.

@smarterclayton smarterclayton merged commit 9122b3d into openshift:master Jan 26, 2019
@smarterclayton
Copy link
Contributor Author

One risk is that an api failure flake could result in no logs. We should definitely consider something like queue terminating early if enough sequential things aren’t gathered.

@openshift-ci-robot
Copy link

@smarterclayton: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws f56f493 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

`openshift mongodb replication creating from a template`, // flaking on deployment
`should use be able to process many pods and reuse local volumes`, // https://bugzilla.redhat.com/show_bug.cgi?id=1635893

`[sig-storage] Volume limits should verify that all nodes have volume limits`, // flaking due to a kubelet issue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this again in a job kicked off after the merge. Maybe the leading [sig-storage] here is a problem? The entries above don't seem to have those.

Copy link
Contributor Author

@smarterclayton smarterclayton Jan 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it needed to be regex quoted. Manually check and merged a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-linking #21867.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants